Part 1: National Survey on Drug Use and Health data

Drug addiction is commonly referred to as an illness that doesn’t discriminate which is why i’m interested to see how geographic distribution (sometimes characterized by income, racial, and demographic strata) would aid in visualizing incidences of drug use. Admittedly, three of the data sets I will be using from the NSDUH database only estimate incidences of drug use which does not necessarily imply a dependence to substances, but I am also interested in mapping out the geographic distribution of drug use estimates in order to later compare it to drug arrest rates across different county locations. I will be focusing on heroin, cocaine, and marijuana use in the state of New York from the years of 2014 to 2016. The reason why I am not using a more recent dataset is because the NSDUH has only released datasets that estimate drug use starting from the age of 12, which would hamper the comparison with the New York arrest data since the latter describes individuals who are 18 or older.

## # A tibble: 3 Ă— 7
##   outcome                   age_group years geography estimate ci_lower ci_upper
##   <chr>                     <chr>     <chr> <chr>     <chr>    <chr>    <chr>   
## 1 Cocaine Use in the Past … 12 or Ol… 2014… West      0.02227… 0.02052… 0.02416…
## 2 Cocaine Use in the Past … 12 to 17  2014… West      0.00891… 0.00704… 0.01126…
## 3 Cocaine Use in the Past … 18 or Ol… 2014… West      0.02366… 0.02176… 0.02571…
## # A tibble: 3 Ă— 7
##   outcome                   age_group years geography estimate ci_lower ci_upper
##   <chr>                     <chr>     <chr> <chr>     <chr>    <chr>    <chr>   
## 1 Heroin Use in the Past Y… 12 or Ol… 2014… West      0.00297… 0.00235… 0.00375…
## 2 Heroin Use in the Past Y… 12 to 17  2014… West      0.00089… 0.00053… 0.00149…
## 3 Heroin Use in the Past Y… 18 or Ol… 2014… West      0.00319… 0.00252… 0.00404…
## # A tibble: 3 Ă— 7
##   outcome                   age_group years geography estimate ci_lower ci_upper
##   <chr>                     <chr>     <chr> <chr>     <chr>    <chr>    <chr>   
## 1 Marijuana Use in the Pas… 12 or Ol… 2014… West      0.16191… 0.15640… 0.16757…
## 2 Marijuana Use in the Pas… 12 to 17  2014… West      0.14199… 0.13453… 0.14981…
## 3 Marijuana Use in the Pas… 18 or Ol… 2014… West      0.16398… 0.15803… 0.17012…

I started off by loading all four data sets. All variables seem to be identical which is great since I can create a function that cleans all of my data sets by sub-setting the geography variable to only include regions with the word “New York” in them, using str_detect(). I am also filtering out entries in the age_group variable I am not interested in (12 or older, and 12 to 17). I have also manually checked that the function cleans each data set correctly. The age_group entry for 18 or Older would be plentiful for my analysis but I will also include the 18 to 25 and 26 or Older entries as it might be interesting to visualize the nuances of drug use among these two different age groups.

Cleaning Data

## # A tibble: 11 Ă— 7
##    outcome                  age_group years geography estimate ci_lower ci_upper
##    <chr>                    <chr>     <chr> <chr>     <chr>    <chr>    <chr>   
##  1 Marijuana Use in the Pa… 18 or Ol… 2014… New York… 0.11629… 0.09561… 0.14076…
##  2 Marijuana Use in the Pa… 18 to 25  2014… New York… 0.26630… 0.21776… 0.32120…
##  3 Marijuana Use in the Pa… 26 or Ol… 2014… New York… 0.08589… 0.06676… 0.10984…
##  4 Marijuana Use in the Pa… 18 or Ol… 2014… New York… 0.23101… 0.20122… 0.26375…
##  5 Marijuana Use in the Pa… 18 to 25  2014… New York… 0.48161… 0.42278… 0.54095…
##  6 Marijuana Use in the Pa… 26 or Ol… 2014… New York… 0.19409… 0.16325… 0.22917…
##  7 Marijuana Use in the Pa… 18 or Ol… 2014… New York… 0.17636… 0.15326… 0.20213…
##  8 Marijuana Use in the Pa… 18 to 25  2014… New York… 0.34904… 0.30077… 0.40063…
##  9 Marijuana Use in the Pa… 26 or Ol… 2014… New York… 0.14731… 0.12379… 0.17441…
## 10 Marijuana Use in the Pa… 18 or Ol… 2014… New York… 0.12289… 0.09715… 0.15427…
## 11 Marijuana Use in the Pa… 18 to 25  2014… New York… suppres… suppres… suppres…

Some of the entries for specific age groups in each of the drug use datasets have been suppressed by the NSDUH (as an example I included the first 11 rows of the marijuana dataset, the last row has missing entries for estimates), therefore I may not be able to visualize trends across age groups as specifically as I had hoped.

Geographic variables

The NSDUH provides a manual for their substate region definitions and their respective counties, so I have created a dataframe for each of the 11 regions with a geography variable that I will later be able to join with the datasets I just cleaned, as well as the corresponding counties.

In order to retrieve the geometric data to create a map of New York State, I used the “tigris” package in order to save that information under a dataframe called new_york_county. This will be useful later on for creating my map.

## Simple feature collection with 2 features and 12 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -73.98328 ymin: 40.78536 xmax: -73.48271 ymax: 41.36654
## Geodetic CRS:  NAD83
##   STATEFP COUNTYFP COUNTYNS        GEOIDFQ GEOID        NAME           NAMELSAD
## 1      36      005 00974101 0500000US36005 36005       Bronx       Bronx County
## 2      36      119 00974157 0500000US36119 36119 Westchester Westchester County
##   STUSPS STATE_NAME LSAD      ALAND    AWATER                       geometry
## 1     NY   New York   06  109235672  39353304 MULTIPOLYGON (((-73.77242 4...
## 2     NY   New York   06 1115811940 179391718 MULTIPOLYGON (((-73.77237 4...

I then join the dataframe to my combined_regions dataframe under a new dataframe named new_york_county_regions which will include the county geometry entries so I can map them, as well as the previously mentioned geography variable which will allow me to join the NSDUH data for regional estimates of drug use.

## Simple feature collection with 2 features and 13 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -73.98328 ymin: 40.78536 xmax: -73.48271 ymax: 41.36654
## Geodetic CRS:  NAD83
##   STATEFP COUNTYFP COUNTYNS        GEOIDFQ GEOID        NAME           NAMELSAD
## 1      36      005 00974101 0500000US36005 36005       bronx       Bronx County
## 2      36      119 00974157 0500000US36119 36119 westchester Westchester County
##   STUSPS STATE_NAME LSAD      ALAND    AWATER                     geography
## 1     NY   New York   06  109235672  39353304     New York Region 2A: Bronx
## 2     NY   New York   06 1115811940 179391718 New York Region 3: Mid-Hudson
##                         geometry
## 1 MULTIPOLYGON (((-73.77242 4...
## 2 MULTIPOLYGON (((-73.77237 4...

As a form of initial visualization, I plotted each drug use dataset according to age group and there are many missing entries for more specific age groupings (18 to 25 and 26 or Older). The Heroin dataset has missing values across all age groups, even the 18 or Older group which I had hoped to use for my final map. I had to disaggregate my datasets according to each age group since I had trouble with the join functions when those variables weren’t separated. I was hoping to use the facet_wrap() function for brevity but had to instead create maps for each of the age groups. As a brief observation, the drug use percentages appear to be higher in proportion for the 18 to 25 subgroup which is interesting to note. However, I cannot firmly come to that conclusion due to the numerous missing region entries for that subgroup.

Part 2: New York Arrest Data

This next part is related to the dataset from the New york State Department’s arrest data for each of New york’s counties from the years of 2013 up to 2022. Because the document is in excel format, I will skip the first three rows when reading the file since those cells are empty and weren’t formatted correctly. I then filter the data to exclude non-drug arrests and create a new variable called category which distinguishes between felony and misdemeanor arrests since those variables were incorrectly formatted when reading the excel sheet.

Here are plots of total drug arrests and a plot that distinguishes between felony and misdemeanor arrests. Counties such as Kings, Queens, Suffolk, and Erie stand out signicantly in terms of high arrest numbers.

Part 3: Putting it all together

I first plotted the cocaine use dataset using the leaflet package which allowed me to also plot the county drug arrests on top (red circles) for better visualization. I was having some trouble with plotting the arrest data since the geometry data wasn’t recognized as an sf so after trial and error, I did some research and found that the st_centroid() function from the sf package would allow me to plot the data as I had hoped. The arrest data was incredibly skewed with very large arrest numbers in counties such as Erie, Bronx, and counties mostly in regions 1 and 2 so I also scaled the diameter of the circles by taking the square root of the total arrests and dividing that by 10.

Before comparing the two data with one another, it is important to note that there are many limitations. First, I am only plotting cases of specific drug use against much less specific drug arrest data. Further, some drug arrests may be related to an intent to distribute which doesn’t always coincide with personal drug use. Finally, this visual may be a bit misleading since it doesn’t take into account population size, metropolitan areas (albeit some can be intuitively identified for well-known counties in New York City). It also compares regional estimates for counties lumped together with arrests that are specific to counties so there are likely some lost aspects between specific counties and their drug trends.

With that said, regions 2B(Kings), 8, 9, and 5 show the highest estimates of cocaine use for individuals aged 18 or older (shown on the map in light purple). The arrest circles seem to be significantly larger In Erie county (region 11, outer left), Queens, Richmond, Kings (region 2) and some areas in some areas in region 1 (bottom right). This is interesting to see given that those regions show the lowest percentages of cocaine use, relative to other New York regions but perhaps dense populations (especially in region 2) may account for that disproportional trend.

As for marijuana use, regions 4, followed by region 9, and region 2B (Kings) now show the highest estimates of marijuana use among those 18 or older. The plot seems to be lighter in color overall (indicating higher use rates) with the exception of regions 3 and 7 which have lower percentages relative to the rest of the regions. The lighter regions just mentioned seem to have medium-sized arrest circles, however, Queens which also has a relatively high marijuana use rate seems to also have a very large arrest rate, relative to those two regions that show similar percentages. Again, this may be attributed to the area’s dense population.

This last plot is obviously missing two regions (regions 5 and 6) since the data was repressed by the NSDUH but we can still visualize disproportinate trends of heroin use in the remaining regions. Region 8 now shows the highest percentage of heroin use (center, pale yellow), followed by region 11 (left, light orange), region 2B (Kings, light orange), and then regions 2C (New York county), 4 and 7 (pink). This is the first instance in which high rates of drug use (as previously noted in Erie county, region 2 and region 1) seem to coincide with large arrest circles. Again, this is not necessarily related but it is interesting to note that region 8 has the highest rate of heroin use but doesn’t show the same large arrest rates, relative to the previously mentioned regions. Once again, this is probably related to population size, and density.